Skip to content

llama + spec: MTP Support #22673

Open
am17an wants to merge 19 commits into
ggml-org:masterfrom
am17an:mtp-clean
Open

llama + spec: MTP Support #22673
am17an wants to merge 19 commits into
ggml-org:masterfrom
am17an:mtp-clean

Conversation

@am17an
Copy link
Copy Markdown
Contributor

@am17an am17an commented May 4, 2026

Overview

This PR adds support for MTP (Multi Token Prediction) heads. I tested this on Qwen3.6 27B and Qwen3.6 35BA3B but in principle it should work for any MTP model. I've posted the detailed results below, but typically I see a steady-state acceptance of around 75% with 3 draft tokens, which is more than >2x speed-up over baseline. The design decisions I took to get to this stage are as follows:

Next Steps

Performance

A simple bench for testing various prompts is here: https://gist.github.com/am17an/228edfb84ed082aa88e3865d6fa27090. Posting the results below:

Performance on DGX Spark 🧵

No MTP (baseline)

./llama-server -m ../qwen3.6-q8_0.gguf -np 1 --chat-template-kwargs "{\"preserve_thinking\": true}"

  code_python        pred= 192 draft=   0 acc=   0 rate=n/a tok/s=7.0
  code_cpp           pred= 192 draft=   0 acc=   0 rate=n/a tok/s=7.3
  explain_concept    pred= 192 draft=   0 acc=   0 rate=n/a tok/s=7.3
  summarize          pred=  53 draft=   0 acc=   0 rate=n/a tok/s=7.1
  qa_factual         pred= 177 draft=   0 acc=   0 rate=n/a tok/s=7.0
  translation        pred=  22 draft=   0 acc=   0 rate=n/a tok/s=7.7
  creative_short     pred= 192 draft=   0 acc=   0 rate=n/a tok/s=7.1
  stepwise_math      pred= 192 draft=   0 acc=   0 rate=n/a tok/s=7.2
  long_code_review   pred= 192 draft=   0 acc=   0 rate=n/a tok/s=7.0

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1404,
  "total_draft": 0,
  "total_draft_accepted": 0,
  "aggregate_accept_rate": null,
  "wall_s_total": 201.07
}

MTP --spec-draft-max-n 3

./llama-server -m ../qwen3.6-q8_0-mtp.gguf -np 1 --chat-template-kwargs "{\"preserve_thinking\": true}" --spec-type mtp --spec-draft-n-max 3

  code_python        pred= 192 draft= 153 acc= 139 rate=0.908 tok/s=21.6
  code_cpp           pred= 192 draft= 176 acc= 132 rate=0.750 tok/s=18.7
  explain_concept    pred= 192 draft= 191 acc= 126 rate=0.660 tok/s=16.3
  summarize          pred=  55 draft=  51 acc=  37 rate=0.726 tok/s=17.9
  qa_factual         pred= 177 draft= 174 acc= 118 rate=0.678 tok/s=16.5
  translation        pred=  22 draft=  24 acc=  13 rate=0.542 tok/s=13.9
  creative_short     pred= 192 draft= 200 acc= 123 rate=0.615 tok/s=15.8
  stepwise_math      pred= 192 draft= 171 acc= 133 rate=0.778 tok/s=19.3
  long_code_review   pred= 192 draft= 179 acc= 131 rate=0.732 tok/s=18.0

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1406,
  "total_draft": 1319,
  "total_draft_accepted": 952,
  "aggregate_accept_rate": 0.7218,
  "wall_s_total": 83.8
}

MTP --spec-draft-max-n 2

./llama-server -m ../qwen3.6-q8_0-mtp.gguf -np 1 --chat-template-kwargs "{\"preserve_thinking\": true}" --spec-type mtp --spec-draft-n-max 2

  code_python        pred= 192 draft= 134 acc= 123 rate=0.918 tok/s=17.4
  code_cpp           pred= 192 draft= 145 acc= 118 rate=0.814 tok/s=16.5
  explain_concept    pred= 192 draft= 148 acc= 116 rate=0.784 tok/s=16.1
  summarize          pred=  55 draft=  44 acc=  32 rate=0.727 tok/s=15.6
  qa_factual         pred= 192 draft= 132 acc= 125 rate=0.947 tok/s=18.2
  translation        pred=  22 draft=  18 acc=  12 rate=0.667 tok/s=15.2
  creative_short     pred= 192 draft= 149 acc= 116 rate=0.778 tok/s=16.1
  stepwise_math      pred= 192 draft= 139 acc= 121 rate=0.871 tok/s=17.2
  long_code_review   pred= 192 draft= 153 acc= 114 rate=0.745 tok/s=15.6

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1421,
  "total_draft": 1062,
  "total_draft_accepted": 877,
  "aggregate_accept_rate": 0.8258,
  "wall_s_total": 90.44
}

Draft model (Qwen3.5 0.8B) with spec-draft-n-max 16 with partial rollback

llama-server -m ../qwen3.6/Qwen3.6-27B-Q8_0.gguf -hfd unsloth/Qwen3.5-0.8B-GGUF:Q8_0 --spec-draft-n-max 16 -np 1 --chat-template-kwargs "{\"preserve_thinking\": true}"

  code_python        pred= 192 draft= 188 acc= 156 rate=0.830 tok/s=26.4
  code_cpp           pred= 192 draft= 201 acc= 126 rate=0.627 tok/s=16.8
  explain_concept    pred= 192 draft= 263 acc= 112 rate=0.426 tok/s=12.7
  summarize          pred=  57 draft=  63 acc=  39 rate=0.619 tok/s=16.9
  qa_factual         pred= 192 draft= 178 acc= 177 rate=0.994 tok/s=47.7
  translation        pred=  23 draft=  18 acc=  15 rate=0.833 tok/s=18.7
  creative_short     pred= 192 draft= 189 acc= 120 rate=0.635 tok/s=15.4
  stepwise_math      pred= 192 draft= 190 acc= 148 rate=0.779 tok/s=22.3
  long_code_review   pred= 192 draft= 207 acc= 120 rate=0.580 tok/s=14.5

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1424,
  "total_draft": 1497,
  "total_draft_accepted": 1013,
  "aggregate_accept_rate": 0.6767,
  "wall_s_total": 81.39
}

Master with draft model with spec-draft-n-max 64 with no partial rollback

llama-server -m ../qwen3.6/Qwen3.6-27B-Q8_0.gguf -hfd unsloth/Qwen3.5-0.8B-GGUF:Q8_0 --spec-draft-n-max 64 -np 1 --chat-template-kwargs "{\"preserve_thinking\": true}"

  code_python        pred= 192 draft= 174 acc= 159 rate=0.914 tok/s=27.2
  code_cpp           pred= 192 draft= 138 acc= 120 rate=0.870 tok/s=15.0
  explain_concept    pred= 192 draft= 170 acc= 101 rate=0.594 tok/s=11.4
  summarize          pred=  55 draft=  48 acc=  36 rate=0.750 tok/s=14.6
  qa_factual         pred= 177 draft= 126 acc= 106 rate=0.841 tok/s=13.9
  translation        pred=  22 draft=  13 acc=  13 rate=1.000 tok/s=16.5
  creative_short     pred= 192 draft= 136 acc= 104 rate=0.765 tok/s=12.8
  stepwise_math      pred= 192 draft= 172 acc= 147 rate=0.855 tok/s=22.0
  long_code_review   pred= 192 draft= 160 acc= 111 rate=0.694 tok/s=13.0

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1406,
  "total_draft": 1137,
  "total_draft_accepted": 897,
  "aggregate_accept_rate": 0.7889,
  "wall_s_total": 97.13
}

How to use

I've uploaded the GGUF which I made by using the convert_hf_to_gguf.py changes in this PR. Here is another GGUF for the MoE (35BA3B) model

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: Yes, for debugging and reviewing. Also the convert_hf_to_gguf.py + model definitions. Writing bench for validation against vLLM.

@github-actions github-actions Bot added model Model specific testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs Vulkan Issues specific to the Vulkan backend examples python python script changes server ggml changes relating to the ggml tensor library for machine learning labels May 4, 2026
@ngxson
Copy link
Copy Markdown
Contributor

ngxson commented May 4, 2026

Nice, I think this is a fresh start better than my WIP #18886 (that I still never find the time to continue)

There were some other attempts to add MTP support but they all heavily rely on host <--> device data copy. I assume you tried addressed this, right? (Maybe there was a discussion somewhere but I wasn't aware of)

Copy link
Copy Markdown
Contributor

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(not a review, but opening some discussions)

Comment thread src/llama-memory-recurrent.h
Comment thread src/models/qwen35.cpp Outdated

for (int il = 0; il < n_layer; ++il) {
// MTP/NextN layers are loaded as extra decoder blocks but not executed in the main pass.
const int n_transformer_layers = n_layer - (int)hparams.nextn_predict_layers;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nits, but maybe call it n_main_layers, as technically nextn layer is also a transformer layer

Comment thread tools/server/server-context.cpp Outdated
Comment on lines +811 to +823
//TODO: generalize if this is ok, we should load <arch_name>_mtp arch?
if (params_base.speculative.type == COMMON_SPECULATIVE_TYPE_MTP) {
SRV_INF("loading MTP head from '%s' (override_arch=qwen35_mtp)\n",
params_base.model.path.c_str());

auto mparams_mtp = common_model_params_to_llama(params_base);
mparams_mtp.override_arch = "qwen35_mtp";

model_mtp.reset(llama_model_load_from_file(params_base.model.path.c_str(), mparams_mtp));
if (model_mtp == nullptr) {
SRV_ERR("failed to load MTP head from '%s'\n", params_base.model.path.c_str());
return false;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you look at #18886, the better way is to move llama_graph_type to the public API, then load the context with the appropriate graph type

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that seems like the correct way to do this if we want to support MTP in a generic way

@am17an
Copy link
Copy Markdown
Contributor Author

am17an commented May 4, 2026

@ngxson yes the h2d was discussed with GG, he's working on a refactor which will allow us to share tensors between two llama context

@pwilkin
Copy link
Copy Markdown
Member

pwilkin commented May 4, 2026

Great work, this should massively bridge the TG gap with vLLM, or maybe even surpass it together with tensor-parallel.

@cmp-nct
Copy link
Copy Markdown
Contributor

cmp-nct commented May 4, 2026

in my opinion Qwen 3.6 is the most important thing that happened in open source models in a long time, this is going to be so valuable.
I wonder if this, once merged, could be combined with ngram drafting ?
So MTP is used until ngram is triggered - switching to ngram until rejection and back to MTP

ngram could be set to match only very strong and long candidates - for large repetitive paraphrasing
and MTP fills the gap

@Dampfinchen
Copy link
Copy Markdown

Dampfinchen commented May 4, 2026

" idea is that MTP should automatically start and we shouldn't need to distribute the MTP gguf separately but also it has it's own context/kv-cache etc." -> Does this mean MTP needs additional resources (RAM/VRAM?)

If so, there should always be an option to remain to disable it. Right now on my system (6 GB VRAM, 32 GB RAM), speculative decoding just makes things much slower even on very small draft models because of that exact reason, they need own context and kv-cache. Such low to midrange systems already operate on the edge in terms of memory.

@mbednarek360
Copy link
Copy Markdown

I'm getting garbage responses running this PR on the Vulkan backend with an R9700 using llama-server. I'm using the GGUF you linked above. Interestingly, draft acceptance is only 0.01282.

Prompt: "Hello!"
Response:

The from,

;::...

... on;srible威风to{ islitor

\ ...

• We
&eq和chn ***, on
Prompt (:
mouth

“ ? forM� P 

@am17an
Copy link
Copy Markdown
Contributor Author

am17an commented May 4, 2026

@cmp-nct I'm not sure, but could be possible

@Dampfinchen as of right now it is opt-in via --spec-type mtp, but in terms of memory it should be < 10% of overall memory used (it's just a single layer transformer + kv cache, much lighter than draft models)

@mbednarek360 I've only tested this on a small number of CUDA devices as of now, once it's ready to review I would have tested more devices/backends. In particular this PR relies on #22400 which is not implemented for vulkan for now, if you ask an LLM to add support for that you might get a little further Vulkan and Metal also tested

@nawoa
Copy link
Copy Markdown

nawoa commented May 4, 2026

Might it be possible/useful to run the draft model on a second GPU? Given that MTP weights model are relatively small this might provide a useful speedup on systems with a dedicated high-VRAM "AI" GPU with a cheaper low-VRAM "normal" GPU used for display output, etc... possibly prevent some degree of resource contention.

@cturan
Copy link
Copy Markdown

cturan commented May 4, 2026

Thank you, we are eagerly awaiting this to become stable, here automated test results for my machine;

__
Qwen3.6-27B Q6_K benchmark on llama.cpp b9025-10829dbcc / PR #22673 branch
Hardware: RTX 3090 24GB + RTX 3060 12GB
Runtime flags: -fa on -c 10000 -np 1 -ngl 99 --no-mmap --no-cache-prompt
Endpoint: /completion, raw text prompt
Prompt: 6978 tokens
Generation: 256 tokens
Runs: 3 measured runs after warmup

mode model prefill tok/s avg generation tok/s avg MTP acceptance loaded VRAM
MTP enabled Qwen3.6-27B-MTP-Q6_K.gguf + --spec-type mtp --spec-draft-n-max 3 665.14 42.45 76.0% 24.96 GiB
MTP disabled, same GGUF Qwen3.6-27B-MTP-Q6_K.gguf, no spec 1315.46 22.97 n/a 22.47 GiB
Existing non-MTP Q6 Qwen3.6-27B-Q6_K.gguf, no spec 1260.12 22.39 n/a 22.59 GiB

Result:

  • MTP improves decode from 22.97 tok/s to 42.45 tok/s on the same GGUF: ~1.85x speedup.
  • Against the existing non-MTP Q6 file, decode improves from 22.39 tok/s to 42.45 tok/s: ~1.90x speedup.
  • Prefill is slower with MTP enabled in this PR path: 665 tok/s vs 1315 tok/s on the same GGUF (~0.51x).
  • MTP adds about 2.49 GiB loaded VRAM in this setup.

@am17an
Copy link
Copy Markdown
Contributor Author

am17an commented May 4, 2026

@cturan Thanks for testing, I'm aware of the issue for the prefill and will work on a fix.

@iiLaurens
Copy link
Copy Markdown

Might be a long shot, but any chance of supporting MTP with a reduced vocabulary? MTP layers are rather chonky and reducing token embeddings might help users with less VRAM by filtering out certain languages. Obviously the full model will still be able to produce those tokens if need be so it won't be gimped.

@nybblr
Copy link
Copy Markdown

nybblr commented May 4, 2026

Working on taking this for a spin with the Q4_K_M quant of Qwen3.6-35BA3B. I was gonna try to start from unsloth's quant since they already perform really well, but of course they don't have any mtp layers.

@am17an Think it would work if I just "steal" the layers from your q8 quant and merge them into the unsloth quant? (add blk.40 and bump some top-level config like block_count and kv_count)

@volkermauel
Copy link
Copy Markdown

only a quick test run, 1x 5090 qwen3.6-27b mtp 3, q4_0 quantized, kv also q4_0

slot launch_slot_: id  0 | task -1 | sampler chain: logits -> penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id  0 | task 532 | processing task, is_child = 0
slot update_slots: id  0 | task 532 | new prompt, n_ctx_slot = 200192, n_keep = 0, task.n_tokens = 16
slot update_slots: id  0 | task 532 | n_past = 3, slot.prompt.tokens.size() = 1327, seq_id = 0, pos_min = 1326, n_swa = 0
slot update_slots: id  0 | task 532 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id  0 | task 532 | n_tokens = 0, memory_seq_rm [0, end)
srv  log_server_r: done request: POST /v1/chat/completions 192.168.178.49 200
slot update_slots: id  0 | task 532 | prompt processing progress, n_tokens = 12, batch.n_tokens = 12, progress = 0.750000
slot update_slots: id  0 | task 532 | n_tokens = 12, memory_seq_rm [12, end)
slot init_sampler: id  0 | task 532 | init sampler, took 0.01 ms, tokens: text = 16, total = 16
slot update_slots: id  0 | task 532 | prompt processing done, n_tokens = 16, batch.n_tokens = 4
slot print_timing: id  0 | task 532 |
prompt eval time =������63.16 ms /����16 tokens (����3.95 ms per token,   253.34 tokens per second)
�������eval time =   56063.04 ms /  5913 tokens (����9.48 ms per token,   105.47 tokens per second)
������total time =   56126.20 ms /  5929 tokens
draft acceptance rate = 0.79728 ( 4169 accepted /  5229 generated)
statistics mtp: #calls(b,g,a) = 2 2272 1976, #gen drafts = 2272, #acc drafts = 1976, #gen tokens = 6816, #acc tokens = 4950, dur(b,g,a) = 0.007, 15393.656, 64.921 ms
slot������release: id  0 | task 532 | stop processing: n_tokens = 5928, truncated = 0
srv  update_slots: all slots are idle

same model, same config (except mtp)

slot update_slots: id  0 | task 0 | prompt processing done, n_tokens = 16, batch.n_tokens = 4
slot print_timing: id  0 | task 0 | 
prompt eval time =      91.85 ms /    16 tokens (    5.74 ms per token,   174.20 tokens per second)
       eval time =  103127.94 ms /  6571 tokens (   15.69 ms per token,    63.72 tokens per second)
      total time =  103219.79 ms /  6587 tokens
slot      release: id  0 | task 0 | stop processing: n_tokens = 6586, truncated = 0
srv  update_slots: all slots are idle

prompt „create a flappy bird clone“

(I‘m not creative, sorry)

Great Speedup!

@alexandrupetraru
Copy link
Copy Markdown

this is a game changer, on Strix Halo with the q8 Qwen 3.6 35B3A jumping from 40 to 70 tg at low context and for the 27B from 12 to 25 tg(with layer split 7900 xtx and strix halo 50,50) for coding. We need this one to master asap together with turbo4, it performs very well and without any issues. Good job

@GloballyUniquePlaceholder
Copy link
Copy Markdown

On a 3060 Laptop 6GB vram + 64GB ram running your provided Qwen 3.6 35A3B gguf there is a reasonable speed up.

spec-draft-n-max average tk\s wall_s_total aggregate_accept_rate
n/a - no mtp 22.92 77.69 n/a
1 27.58 68.34 0.8835
2 29.39 66.00 0.815
3 27.78 67.96 0.7127
4 26.09 72.23 0.6421
raw results

spec-draft-n-max 4

llama.cpp\build\bin\Release\llama-server.exe -fa on -c 5000 -np 1 -fit on -m Qwen3.6-35BA3B-MTP.gguf --chat-template-kwargs "{\"preserve_thinking\": true}" --spec-type mtp --spec-draft-n-max 4

python mtp-bench.py
  code_python        pred= 192 draft= 180 acc= 146 rate=0.811 tok/s=31.3
  code_cpp           pred= 192 draft= 216 acc= 136 rate=0.630 tok/s=22.7
  explain_concept    pred= 192 draft= 224 acc= 134 rate=0.598 tok/s=22.3
  summarize          pred=  53 draft=  52 acc=  39 rate=0.750 tok/s=33.3
  qa_factual         pred= 192 draft= 196 acc= 141 rate=0.719 tok/s=29.2
  translation        pred=  22 draft=  32 acc=  13 rate=0.406 tok/s=19.4
  creative_short     pred= 192 draft= 264 acc= 124 rate=0.470 tok/s=20.7
  stepwise_math      pred= 192 draft= 192 acc= 143 rate=0.745 tok/s=30.7
  long_code_review   pred= 192 draft= 220 acc= 136 rate=0.618 tok/s=25.2

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1419,
  "total_draft": 1576,
  "total_draft_accepted": 1012,
  "aggregate_accept_rate": 0.6421,
  "wall_s_total": 72.23
}

spec-draft-n-max 3

llama.cpp\build\bin\Release\llama-server.exe -fa on -c 5000 -np 1 -fit on -m Qwen3.6-35BA3B-MTP.gguf --chat-template-kwargs "{\"preserve_thinking\": true}" --spec-type mtp --spec-draft-n-max 3

python mtp-bench.py
  code_python        pred= 192 draft= 165 acc= 136 rate=0.824 tok/s=30.2
  code_cpp           pred= 192 draft= 168 acc= 135 rate=0.804 tok/s=27.6
  explain_concept    pred= 192 draft= 189 acc= 128 rate=0.677 tok/s=25.3
  summarize          pred=  53 draft=  48 acc=  36 rate=0.750 tok/s=32.5
  qa_factual         pred= 192 draft= 180 acc= 131 rate=0.728 tok/s=29.2
  translation        pred=  22 draft=  24 acc=  13 rate=0.542 tok/s=24.5
  creative_short     pred= 192 draft= 210 acc= 120 rate=0.571 tok/s=23.2
  stepwise_math      pred= 192 draft= 174 acc= 133 rate=0.764 tok/s=30.5
  long_code_review   pred= 192 draft= 189 acc= 128 rate=0.677 tok/s=27.2

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1419,
  "total_draft": 1347,
  "total_draft_accepted": 960,
  "aggregate_accept_rate": 0.7127,
  "wall_s_total": 67.96
}

spec-draft-n-max 2

llama.cpp\build\bin\Release\llama-server.exe -fa on -c 5000 -np 1 -fit on -m Qwen3.6-35BA3B-MTP.gguf --chat-template-kwargs "{\"preserve_thinking\": true}" --spec-type mtp --spec-draft-n-max 2

python mtp-bench.py
  code_python        pred= 192 draft= 132 acc= 125 rate=0.947 tok/s=31.5
  code_cpp           pred= 192 draft= 140 acc= 120 rate=0.857 tok/s=27.0
  explain_concept    pred= 192 draft= 152 acc= 114 rate=0.750 tok/s=25.6
  summarize          pred=  53 draft=  40 acc=  32 rate=0.800 tok/s=32.2
  qa_factual         pred= 192 draft= 144 acc= 119 rate=0.826 tok/s=31.1
  translation        pred=  22 draft=  16 acc=  13 rate=0.812 tok/s=30.8
  creative_short     pred= 192 draft= 156 acc= 113 rate=0.724 tok/s=25.9
  stepwise_math      pred= 192 draft= 144 acc= 119 rate=0.826 tok/s=31.3
  long_code_review   pred= 192 draft= 146 acc= 117 rate=0.801 tok/s=29.1

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1419,
  "total_draft": 1070,
  "total_draft_accepted": 872,
  "aggregate_accept_rate": 0.815,
  "wall_s_total": 66.0
}

spec-draft-n-max 1

llama.cpp\build\bin\Release\llama-server.exe -fa on -c 5000 -np 1 -fit on -m Qwen3.6-35BA3B-MTP.gguf --chat-template-kwargs "{\"preserve_thinking\": true}" --spec-type mtp --spec-draft-n-max 1

python mtp-bench.py
  code_python        pred= 192 draft=  96 acc=  94 rate=0.979 tok/s=28.3
  code_cpp           pred= 192 draft= 100 acc=  90 rate=0.900 tok/s=26.2
  explain_concept    pred= 192 draft= 102 acc=  89 rate=0.873 tok/s=25.9
  summarize          pred=  56 draft=  29 acc=  26 rate=0.897 tok/s=30.6
  qa_factual         pred= 192 draft= 100 acc=  90 rate=0.900 tok/s=28.5
  translation        pred=  22 draft=  12 acc=   9 rate=0.750 tok/s=27.0
  creative_short     pred= 192 draft= 104 acc=  86 rate=0.827 tok/s=24.9
  stepwise_math      pred= 192 draft= 102 acc=  88 rate=0.863 tok/s=28.7
  long_code_review   pred= 192 draft= 102 acc=  88 rate=0.863 tok/s=28.1

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1422,
  "total_draft": 747,
  "total_draft_accepted": 660,
  "aggregate_accept_rate": 0.8835,
  "wall_s_total": 68.34
}

no mtp

llama.cpp\build\bin\Release\llama-server.exe -fa on -c 5000 -np 1 -fit on -m Qwen3.6-35BA3B-MTP.gguf --chat-template-kwargs "{\"preserve_thinking\": true}"

python mtp-bench.py
  code_python        pred= 192 draft=   0 acc=   0 rate=n/a tok/s=22.2
  code_cpp           pred= 192 draft=   0 acc=   0 rate=n/a tok/s=22.1
  explain_concept    pred= 192 draft=   0 acc=   0 rate=n/a tok/s=22.1
  summarize          pred=  53 draft=   0 acc=   0 rate=n/a tok/s=25.9
  qa_factual         pred= 192 draft=   0 acc=   0 rate=n/a tok/s=22.1
  translation        pred=  22 draft=   0 acc=   0 rate=n/a tok/s=22.3
  creative_short     pred= 192 draft=   0 acc=   0 rate=n/a tok/s=21.4
  stepwise_math      pred= 192 draft=   0 acc=   0 rate=n/a tok/s=24.0
  long_code_review   pred= 192 draft=   0 acc=   0 rate=n/a tok/s=24.2

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1419,
  "total_draft": 0,
  "total_draft_accepted": 0,
  "aggregate_accept_rate": null,
  "wall_s_total": 77.69
}

@ninjas28
Copy link
Copy Markdown

ninjas28 commented May 5, 2026

Crashes when using -sm tensor with llama-server launch command args -hf am17an/Qwen3.6-27B-MTP-GGUF:Q8_0 -sm tensor -np 1 --chat-template-kwargs "{\"preserve_thinking\": true}" --spec-type mtp --spec-draft-n-max 3. Using -sm tensor without MTP works fine. This is on a triple GPU setup using ROCm.

srv  params_from_: Chat format: peg-native
slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = -1
srv  get_availabl: updating prompt cache
srv          load:  - looking for better prompt, base f_keep = -1.000, sim = 0.000
srv        update:  - cache state: 0 prompts, 0.000 MiB (limits: 8192.000 MiB, 262144 tokens, 8589934592 est)
srv  get_availabl: prompt cache update took 0.01 ms
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist 
slot launch_slot_: id  0 | task 0 | processing task, is_child = 0
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 356
slot update_slots: id  0 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 352, batch.n_tokens = 352, progress = 0.988764
/root/llama.cpp/ggml/src/ggml-backend-meta.cpp:1013: GGML_ASSERT(split_state.ne[j] * tensor->src[i]->ne[src_ss[i].axis] == sum * tensor->ne[split_state.axis]) failed
/root/llama.cpp/build/bin/libggml-base.so.0(+0x1b25b)[0x74b4b4ca925b]
/root/llama.cpp/build/bin/libggml-base.so.0(ggml_print_backtrace+0x21f)[0x74b4b4ca96df]
/root/llama.cpp/build/bin/libggml-base.so.0(ggml_abort+0x152)[0x74b4b4ca98b2]
/root/llama.cpp/build/bin/libggml-base.so.0(+0x41506)[0x74b4b4ccf506]
/root/llama.cpp/build/bin/libggml-base.so.0(+0x3d579)[0x74b4b4ccb579]
/root/llama.cpp/build/bin/libggml-base.so.0(+0x41adb)[0x74b4b4ccfadb]
/root/llama.cpp/build/bin/libggml-base.so.0(ggml_gallocr_alloc_graph+0x474)[0x74b4b4cbff54]
/root/llama.cpp/build/bin/libggml-base.so.0(ggml_backend_sched_alloc_graph+0x111)[0x74b4b4cc6351]
/root/llama.cpp/build/bin/libllama.so.0(_ZN13llama_context14process_ubatchERK12llama_ubatch14llm_graph_typeP22llama_memory_context_iR11ggml_status+0xe8)[0x74b4b44dac08]
/root/llama.cpp/build/bin/libllama.so.0(_ZN13llama_context6decodeERK11llama_batch+0x37b)[0x74b4b44d912b]
/root/llama.cpp/build/bin/libllama.so.0(llama_decode+0x10)[0x74b4b44da780]
/root/llama.cpp/build/bin/libllama.so.0(_ZN13llama_context21handle_mtp_for_ubatchEiPKiS1_P11ggml_tensor+0x20d)[0x74b4b44da9bd]
/root/llama.cpp/build/bin/libllama.so.0(_ZN13llama_context14process_ubatchERK12llama_ubatch14llm_graph_typeP22llama_memory_context_iR11ggml_status+0x142)[0x74b4b44dac62]
/root/llama.cpp/build/bin/libllama.so.0(_ZN13llama_context6decodeERK11llama_batch+0x37b)[0x74b4b44d912b]
/root/llama.cpp/build/bin/libllama.so.0(llama_decode+0x10)[0x74b4b44da780]
llama-server(+0xf846e)[0x63c5e42c046e]
llama-server(+0x172971)[0x63c5e433a971]
llama-server(+0x5842c)[0x63c5e422042c]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x74b4b3c29d90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x74b4b3c29e40]
llama-server(+0x58cd5)[0x63c5e4220cd5]
Aborted```

@superjamie
Copy link
Copy Markdown

Tested on 3x RTX3060 12Gb. Sorry I don't have the VRAM for your Q8, I used RDson/Qwen3.6-27B-MTP-Q4_K_M-GGUF which was quantized with ik_llama's MTP.

Prompt: "Write a simple minimal hash table implementation in C99."

Three runs with no MTP, avg generation 18.51 tok/sec:

llama-server --model /models/RDson/Qwen3.6-27B-MTP-Q4_K_M-GGUF/Qwen3.6-27B-MTP-Q4_K_M.gguf \
 --port 8080 --host 0.0.0.0 --n-gpu-layers 999 --flash-attn on --ctx-size $((16*1024)) \
 --temp 0.6 --top-p 0.95 --presence-penalty 0.0 --top-k 20 --min-p 0.0 --repeat_penalty 1.0 \
 --no-mmproj --chat-template-kwargs '{"enable_thinking":false}'

prompt eval time =     177.62 ms /    24 tokens (    7.40 ms per token,   135.12 tokens per second)
       eval time =   99331.08 ms /  1837 tokens (   54.07 ms per token,    18.49 tokens per second)
      total time =   99508.70 ms /  1861 tokens

prompt eval time =     159.10 ms /    24 tokens (    6.63 ms per token,   150.85 tokens per second)
       eval time =  107505.42 ms /  1988 tokens (   54.08 ms per token,    18.49 tokens per second)
      total time =  107664.52 ms /  2012 tokens

prompt eval time =     158.43 ms /    24 tokens (    6.60 ms per token,   151.49 tokens per second)
       eval time =   48263.07 ms /   895 tokens (   53.93 ms per token,    18.54 tokens per second)
      total time =   48421.51 ms /   919 tokens

Three runs with MTP, avg generation 32.24 tok/sec:

llama-server --model /models/RDson/Qwen3.6-27B-MTP-Q4_K_M-GGUF/Qwen3.6-27B-MTP-Q4_K_M.gguf \
 --port 8080 --host 0.0.0.0 --n-gpu-layers 999 --flash-attn on --ctx-size $((16*1024)) \
 --temp 0.6 --top-p 0.95 --presence-penalty 0.0 --top-k 20 --min-p 0.0 --repeat_penalty 1.0 \
 --no-mmproj --chat-template-kwargs '{"enable_thinking":false}' \
 --spec-type mtp --spec-draft-n-max 3 --parallel 1

prompt eval time =     232.24 ms /    24 tokens (    9.68 ms per token,   103.34 tokens per second)
       eval time =   34610.94 ms /  1110 tokens (   31.18 ms per token,    32.07 tokens per second)
      total time =   34843.18 ms /  1134 tokens 
      
prompt eval time =     207.99 ms /    24 tokens (    8.67 ms per token,   115.39 tokens per second)
       eval time =   32110.05 ms /  1064 tokens (   30.18 ms per token,    33.14 tokens per second)
      total time =   32318.03 ms /  1088 tokens
      
prompt eval time =     208.50 ms /    24 tokens (    8.69 ms per token,   115.11 tokens per second)
       eval time =   39029.34 ms /  1230 tokens (   31.73 ms per token,    31.51 tokens per second)
      total time =   39237.84 ms /  1254 tokens 

Result 74% speedup. Wow!

Thank you for your work. You will make many users happy with this. What an exciting PR!

One small hiccup. On my initial attempt I got the error message:

load_model: MTP currently supports only n_parallel=1; got 4

Adding --parallel 1 fixed that.

@i386
Copy link
Copy Markdown

i386 commented May 14, 2026

gone ahead and implemented metal backend support for this am17an#10

@ggerganov
Copy link
Copy Markdown
Member

@pepedombo Could you try bumping the batch size to -b 8192 -ub 512 and see if it helps with the PP?

@pepedombo
Copy link
Copy Markdown

@pepedombo Could you try bumping the batch size to -b 8192 -ub 512 and see if it helps with the PP?

Already tried various batches and it simply goes with constant speed.

prompt eval time = 27609.50 ms / 18189 tokens ( 1.52 ms per token, 658.79 tokens per second)
eval time = 1852.60 ms / 63 tokens ( 29.41 ms per token, 34.01 tokens per second)
total time = 29462.10 ms / 18252 tokens

Evals from qwen code. Without mtp I get ~1300-1400pp and 22-26 tg.

@ggerganov
Copy link
Copy Markdown
Member

Ok, thanks for the info. We'll focus on the PP improvements after the merge.

@DenysAshikhin
Copy link
Copy Markdown

With Vulkan and a25be1b, there seems to be an error when trying to combine mtp and ngram, not sure if this is supposed to work here yet, but just leaving it here in case it is:

init: the tokens of sequence 0 in the input batch have inconsistent sequence positions:
 - the last position stored in the memory module of the context (i.e. the KV cache) for sequence 0 is X = 2065
 - the tokens for sequence 0 in the input batch have a starting position of Y = 2010
 for M-RoPE, it is required that the position satisfies: X < Y
decode: failed to initialize batch
llama_decode: failed to decode, ret = -1
srv  update_slots: Invalid input batch. i = 0, n_batch = 2048, ret = -1
srv    send_error: task id = 0, error: Invalid input batch.

Command:

llama-server -hf am17an/Qwen3.6-35BA3B-MTP-GGUF --host 0.0.0.0 --port 8080 --no-mmap --fit off --spec-type draft-mtp,ngram-mod --spec-draft-n-max 3 --spec-ngram-mod-n-match 24 --spec-ngram-mod-n-min 48 --spec-ngram-mod-n-max 64 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 0.0 --repeat-penalty 1.0 --parallel 1

I apoligise if this isn't the place to ask but how exactly does draft + ngram work?
--spec-draft-n-max 3 -> Means the mtp heads will generate 3 tokens
--ngram params -> In this case look at last 24 tokens, generate at least 48 (if it can) up to a max of 64.

Then what it means in practise it will try to first do a ngram acceptance, and only if that fails (meaning we didn't get 48 - 64 tokens accepted) then it tries the mtp approach?

@Neppord
Copy link
Copy Markdown

Neppord commented May 14, 2026

When i try this branch i get an error during compile (a missing ";"). and after fixing that I get this error, when trying to run

./build/bin/llama-server --host 0.0.0.0 --port 44444 -hf unsloth/Qwen3.6-27B-MTP-GGUF --no-mmproj --alias qwen --reasoning on -c 8192 -ngl 99 -fa on --jinja -np 1 -b 8192 -ub 1024 --cache-type-k q4_0 --cache-type-v q4_0

[...lots of output...]

.../llama.cpp/src/llama-memory-recurrent.cpp:173: GGML_ASSERT(rollback >= 1 && rollback <= (llama_pos) n_rs_seq) failed

[...more output...]

Im runing a Intel Arc Pro B70, have also tried to merge in master to see if that helped, no real change. im using the SYCL backend

am17an and others added 18 commits May 14, 2026 22:19
* MTP: clean-up

* review: use llama_context_type instead of llama_graph_type

* review: remove llama_model_has_mtp

* review: fix convert issues

* convert: fix pycheck

* review: formatting

* use `mtp-` for identifying mtp models

* convert: fix mtp conversion
Currently speculative checkpoint needs to restart from a checkpoint
after some draft tokens are not accepted, this leads to some wastage in
running the target again. This PR adds the ability to rollback upto
`draft_max` by storing the GDN intermediates.
Extend the gated delta net kernel to store intermediate states for
partial rollback support on the Metal backend.

- Add K (snapshot slot count) as a function constant
- Read input state from slot 0 of the 3D state tensor
- Write intermediate states to different slots during token loop
- For K=1, maintain backward-compatible single-slot behavior

Ref: ggml-org@8c05923

Assisted-by: llama.cpp:local pi
@Neppord
Copy link
Copy Markdown

Neppord commented May 14, 2026

Wow, im so impressed by your work!

* server : adjust checkpoint logic

* cont : rm asserts
@unbug
Copy link
Copy Markdown

unbug commented May 14, 2026

Please keep SM70 V100 surported.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Apple Metal https://en.wikipedia.org/wiki/Metal_(API) examples ggml changes relating to the ggml tensor library for machine learning model Model specific Nvidia GPU Issues specific to Nvidia GPUs python python script changes server testing Everything test related Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.